Web Scraping & Data Analysis with Selenium and Python

Author: Vinay Babu

Github: https://github.com/min2bro/WebScrapingwithSelenium

Twitter: @min2bro

IPython Notebook

  • Write, Edit, Replay python scripts
  • Interactive Data Visualization and report Presentation
  • Notebook can be saved and shared
  • Run Selenium Python Scripts

Pandas

Matplotlib

Analysis of the Filmfare Awards for Best Picture from 1955-2015

Web Scraping: Extracting Data from the Web


In [111]:
%matplotlib inline 
from selenium import webdriver
import os,time,json
import pandas as pd
from collections import defaultdict,Counter
import matplotlib.pyplot as plt

In [92]:
url = "http://www.imdb.com/list/ls061683439/"
with open('./filmfare.json',encoding="utf-8") as f:
    datatbl = json.load(f)
driver = webdriver.Chrome(datatbl['data']['chromedriver'])
driver.get(url)

Getting Data


In [93]:
def ExtractText(Xpath):
    textlist=[]
    if(Xpath=="Movies_Runtime_Xpath"):
        [textlist.append(item.text[-10:-7]) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    else:    
        [textlist.append(item.text) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    return textlist

In [94]:
#Extracting Data from Web
Movies_Votes,Movies_Name,Movies_Ratings,Movies_RunTime=[[] for i in range(4)]
datarepo = [[]]*4
Xpath_list = ['Movies_Name_Xpath','Movies_Rate_Xpath','Movies_Runtime_Xpath','Movies_Votes_Xpath']

for i in range(4):
    if(i==3):
        driver.find_element_by_xpath(datatbl['data']['listview']).click()
    datarepo[i] = ExtractText(Xpath_list[i])
    
driver.quit()

Store Data in a Python Dictionary


In [95]:
# Result in a Python Dictionary
Years=range(2015,1954,-1)
result = defaultdict(dict)
for i in range(0,len(datarepo[0])):
    result[i]['Movie Name']= datarepo[0][i]
    result[i]['Year']= Years[i]
    result[i]['Rating']= datarepo[1][i]
    result[i]['Votes']= datarepo[3][i]
    result[i]['RunTime']= datarepo[2][i]

Data before Clean Up


In [96]:
# Dictionary Result
print(json.dumps(result[59], indent=2))


{
  "RunTime": "149",
  "Movie Name": "Boot Polish",
  "Votes": "498",
  "Rating": "8.1",
  "Year": 1956
}

Clean Data


In [97]:
for key,values in result.items():
    values['Votes'] = int(values['Votes'].replace(",",""))
    values['Rating']= float(values['Rating'])
    try:
        values['RunTime'] = int(values['RunTime'])
    except ValueError:
        values['RunTime'] = 0

In [98]:
print(json.dumps(result[0], indent=2))


{
  "RunTime": 158,
  "Movie Name": "Bajirao Mastani",
  "Votes": 17265,
  "Rating": 7.2,
  "Year": 2015
}

How does data in Dictionary looks like


In [99]:
# Dictionary Result
print(json.dumps(result[0], indent=2))


{
  "RunTime": 158,
  "Movie Name": "Bajirao Mastani",
  "Votes": 17265,
  "Rating": 7.2,
  "Year": 2015
}

Data in Pandas Dataframe


In [100]:
# create dataframe
df = pd.DataFrame.from_dict(result,orient='index')
df.sort_values(by='Year',ascending=True,inplace=True)
df = df[['Year', 'Movie Name', 'Rating', 'Votes','RunTime']]
df


Out[100]:
Year Movie Name Rating Votes RunTime
60 1955 Do Bigha Zamin 8.4 1104 131
59 1956 Boot Polish 8.1 498 149
58 1957 Jagriti 7.8 82 0
57 1958 Jhanak Jhanak Payal Baaje 7.3 68 143
56 1959 Mother India 8.1 4841 172
55 1960 Madhumati 8.1 792 110
54 1961 Sujata 7.5 191 161
53 1962 Mughal-E-Azam 8.4 3868 197
52 1963 Jis Desh Men Ganga Behti Hai 7.3 301 167
51 1964 Sahib Bibi Aur Ghulam 8.4 1076 152
50 1965 Bandini 7.8 546 157
49 1966 Dosti 8.4 1009 163
48 1967 Himalay Ki Godmein 7.2 57 0
47 1968 Guide 8.6 3950 183
46 1969 Upkar 7.7 363 175
45 1970 Brahmachari 6.8 239 157
44 1971 Aradhana 7.7 1091 169
43 1972 Toy 7.3 229 160
42 1973 Anand 8.9 10716 122
41 1974 Be-Imaan 7.4 57 133
40 1975 Anuraag 7.4 45 0
39 1976 Rajnigandha 7.5 315 110
38 1977 Deewaar 8.2 5567 174
37 1978 Mausam 8.1 563 156
36 1979 Bhumika 7.6 311 142
35 1980 Main Tulsi Tere Aangan Ki 7.4 75 151
34 1981 Junoon 7.6 341 141
33 1982 Khubsoorat 7.8 937 126
32 1983 Kalyug 7.8 369 152
31 1984 Shakti 7.9 1347 166
... ... ... ... ... ...
29 1986 Sparsh 8.1 380 145
28 1987 Ram Teri Ganga Maili 6.8 685 178
27 1988 Qayamat Se Qayamat Tak 7.6 6159 162
26 1989 Maine Pyar Kiya 7.5 5798 192
25 1990 Ghayal 7.6 2630 163
24 1991 Lamhe 7.4 1928 187
23 1992 Jo Jeeta Wohi Sikandar 8.3 12283 174
22 1993 Hum Hain Rahi Pyar Ke 7.5 3442 163
21 1994 Hum Aapke Hain Koun...! 7.7 10943 206
20 1995 Dilwale Dulhania Le Jayenge 8.3 42046 189
19 1996 Raja Hindustani 6.1 4801 165
18 1997 Dil To Pagal Hai 7.1 13886 179
17 1998 Kuch Kuch Hota Hai 7.8 31472 177
16 1999 Straight from the Heart 7.6 9792 188
15 2000 Kaho Naa... Pyaar Hai 6.9 7870 172
14 2001 Lagaan: Once Upon a Time in India 8.2 68775 224
13 2002 Devdas 7.6 25156 185
12 2003 Koi... Mil Gaya 7.1 12001 171
11 2004 Veer-Zaara 7.9 33458 192
10 2005 Black 8.3 22847 122
9 2006 Rang De Basanti 8.4 68536 157
8 2007 Like Stars on Earth 8.5 82632 165
7 2008 Jodhaa Akbar 7.6 17898 213
6 2009 3 Idiots 8.4 200886 170
5 2010 Dabangg 6.3 19766 126
4 2011 Zindagi Na Milegi Dobara 8.1 41695 155
3 2012 Barfi! 8.2 52256 151
2 2013 Bhaag Milkha Bhaag 8.3 39674 186
1 2014 Queen 8.4 39470 146
0 2015 Bajirao Mastani 7.2 17265 158

61 rows × 5 columns

Movies with Highest Ratings


In [101]:
#Highest Rating Movies
df.sort_values('Rating',ascending=[False]).head(5)


Out[101]:
Year Movie Name Rating Votes RunTime
42 1973 Anand 8.9 10716 122
47 1968 Guide 8.6 3950 183
8 2007 Like Stars on Earth 8.5 82632 165
60 1955 Do Bigha Zamin 8.4 1104 131
53 1962 Mughal-E-Azam 8.4 3868 197

Movies with Maximum Run time


In [119]:
#Movies with maximum Run Time
df.sort_values(['RunTime'],ascending=[False]).head(10)


Out[119]:
Year Movie Name Rating Votes RunTime
14 2001 Lagaan: Once Upon a Time in India 8.2 68775 224
7 2008 Jodhaa Akbar 7.6 17898 213
21 1994 Hum Aapke Hain Koun...! 7.7 10943 206
53 1962 Mughal-E-Azam 8.4 3868 197
11 2004 Veer-Zaara 7.9 33458 192
26 1989 Maine Pyar Kiya 7.5 5798 192
20 1995 Dilwale Dulhania Le Jayenge 8.3 42046 189
16 1999 Straight from the Heart 7.6 9792 188
24 1991 Lamhe 7.4 1928 187
2 2013 Bhaag Milkha Bhaag 8.3 39674 186

Best Picture Run time


In [113]:
df.plot(x=df.Year,y=['RunTime']);


Best Picture Ratings


In [80]:
#Rating Greater than 7
df[(df['Rating']>=7)]['Rating'].count()


Out[80]:
56

In [81]:
#Create Rating Graph
Rating_Hist = defaultdict(dict)

Rating_Hist['Btwn 6&7'] = df[(df['Rating']>=6)&(df['Rating']<7)]['Rating'].count()
Rating_Hist['GTEQ 8'] = df[(df['Rating']>=8)]['Rating'].count()
Rating_Hist['Btwn 7 & 8'] = df[(df['Rating']>=7)&(df['Rating']<8)]['Rating'].count()

plt.bar(range(len(Rating_Hist)), Rating_Hist.values(), align='center',color='brown',width=0.4)
plt.xticks(range(len(Rating_Hist)), Rating_Hist.keys(), rotation=25);



In [82]:
#Histogram for average movie run timr
df['RunTime'].mean()


Out[82]:
154.2622950819672

Best Picture by Genre


In [83]:
# Movies by Genre
Category=Counter(datatbl['data']['Genre'])
df1 = pd.DataFrame.from_dict(Category,orient='index')
df1 = df1.sort_values([0],ascending=[False]).head(5)
df1.plot(kind='barh',color=['g','c','m']);



In [224]:
# Vote vs rating
%matplotlib inline 
import matplotlib.pyplot as plt
df.plot(x=df.Votes,y=['Rating'],kind='scatter');
# plt.scatter(df.Votes, df.Rating, s=df.Rating)


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-224-cc4f6724d65b> in <module>()
      2 get_ipython().magic('matplotlib inline')
      3 import matplotlib.pyplot as plt
----> 4 df.plot(x=df.Votes,y=['Rating'],kind='scatter');
      5 # plt.scatter(df.Votes, df.Rating, s=df.Rating)

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in __call__(self, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   3669                           fontsize=fontsize, colormap=colormap, table=table,
   3670                           yerr=yerr, xerr=xerr, secondary_y=secondary_y,
-> 3671                           sort_columns=sort_columns, **kwds)
   3672     __call__.__doc__ = plot_frame.__doc__
   3673 

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in plot_frame(data, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   2554                  yerr=yerr, xerr=xerr,
   2555                  secondary_y=secondary_y, sort_columns=sort_columns,
-> 2556                  **kwds)
   2557 
   2558 

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _plot(data, x, y, subplots, ax, kind, **kwds)
   2382         plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds)
   2383 
-> 2384     plot_obj.generate()
   2385     plot_obj.draw()
   2386     return plot_obj.result

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in generate(self)
    985         self._compute_plot_data()
    986         self._setup_subplots()
--> 987         self._make_plot()
    988         self._add_table()
    989         self._make_legend()

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _make_plot(self)
   1556         else:
   1557             label = None
-> 1558         scatter = ax.scatter(data[x].values, data[y].values, c=c_values,
   1559                              label=label, cmap=cmap, **self.kwds)
   1560         if cb:

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1961         if isinstance(key, (Series, np.ndarray, Index, list)):
   1962             # either boolean or fancy integer index
-> 1963             return self._getitem_array(key)
   1964         elif isinstance(key, DataFrame):
   1965             return self._getitem_frame(key)

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_array(self, key)
   2006         else:
   2007             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2008             return self.take(indexer, axis=1, convert=True)
   2009 
   2010     def _getitem_multilevel(self, key):

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/generic.py in take(self, indices, axis, convert, is_copy)
   1369         new_data = self._data.take(indices,
   1370                                    axis=self._get_block_manager_axis(axis),
-> 1371                                    convert=True, verify=True)
   1372         result = self._constructor(new_data).__finalize__(self)
   1373 

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/internals.py in take(self, indexer, axis, verify, convert)
   3617         n = self.shape[axis]
   3618         if convert:
-> 3619             indexer = maybe_convert_indices(indexer, n)
   3620 
   3621         if verify:

/Users/vbabu/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in maybe_convert_indices(indices, n)
   1748     mask = (indices >= n) | (indices < 0)
   1749     if mask.any():
-> 1750         raise IndexError("indices are out-of-bounds")
   1751     return indices
   1752 

IndexError: indices are out-of-bounds

In [84]:
import numpy as np
df = pd.DataFrame(np.random.randint(100000, size=(10000, 2)), 
                  columns=['Votes', 'Rating'])

df.plot(kind='scatter', x='Votes', y='Rating', logx=True, alpha=0.5, color='purple', edgecolor='')
plt.ylabel('IMDB Rating')
plt.xlabel('Number of Votes')
plt.show()


Conclusion :

  • Movies with Ratings greater than 7
  • Run time more than 2hrs
  • Category Drama & Musical are most likely to be selcted for Best Picture

In [ ]: